Packages were installed locally and loaded below:
library(tidyverse)
library(lubridate)
library(hms)
library(ggplot2)
library(dplyr)
This dataset titled: NYPD Shooting Incident Data, is from the open data portal. We read it into R using ‘read.csv()’ in a reproducible manner, and examine it using head().
df<-read.csv("https://data.cityofnewyork.us/api/views/833y-fsy8/rows.csv")
head(df)
## INCIDENT_KEY OCCUR_DATE OCCUR_TIME BORO LOC_OF_OCCUR_DESC PRECINCT
## 1 231974218 08/09/2021 01:06:00 BRONX 40
## 2 177934247 04/07/2018 19:48:00 BROOKLYN 79
## 3 255028563 12/02/2022 22:57:00 BRONX OUTSIDE 47
## 4 25384540 11/19/2006 01:50:00 BROOKLYN 66
## 5 72616285 05/09/2010 01:58:00 BRONX 46
## 6 85875439 07/22/2012 21:35:00 BRONX 42
## JURISDICTION_CODE LOC_CLASSFCTN_DESC LOCATION_DESC
## 1 0
## 2 0
## 3 0 STREET GROCERY/BODEGA
## 4 0 PVT HOUSE
## 5 0 MULTI DWELL - APT BUILD
## 6 2 MULTI DWELL - PUBLIC HOUS
## STATISTICAL_MURDER_FLAG PERP_AGE_GROUP PERP_SEX PERP_RACE VIC_AGE_GROUP
## 1 false 18-24
## 2 true 25-44 M WHITE HISPANIC 25-44
## 3 false (null) (null) (null) 25-44
## 4 true UNKNOWN U UNKNOWN 18-24
## 5 true 25-44 M BLACK <18
## 6 false 18-24 M BLACK 18-24
## VIC_SEX VIC_RACE X_COORD_CD Y_COORD_CD Latitude
## 1 M BLACK 1006343 234270 40.80967
## 2 M BLACK 1000082.937500000000000 189064.671875000000000 40.68561
## 3 M BLACK 1020691 257125 40.87235
## 4 M BLACK 985107.312500000000000 173349.796875000000000 40.64249
## 5 F BLACK 1009853.500000000000000 247502.562500000000000 40.84598
## 6 M BLACK 1011046.687500000000000 239814.234375000000000 40.82488
## Longitude Lon_Lat
## 1 -73.92019 POINT (-73.92019278899994 40.80967347200004)
## 2 -73.94291 POINT (-73.94291302299996 40.685609672000055)
## 3 -73.86823 POINT (-73.868233 40.872349)
## 4 -73.99691 POINT (-73.99691224999998 40.642489932000046)
## 5 -73.90746 POINT (-73.90746098599993 40.84598358900007)
## 6 -73.90318 POINT (-73.90317908399999 40.82487781900005)
Below we change date and time types, convert boolean columns to numerical, removing columns we will not use, and making sure null values are in the same format across all columns. We will decide how to use nulls during our analysis, depending on which question we are trying to answer. We only used non null age data when creating visualization 1, and there were no nulls for visualization 2.
df <- df %>% #start of pipechain from tidyverse package - updates df with the following pipechain changes:
mutate(OCCUR_DATE = mdy(OCCUR_DATE)) %>% #changing to type date; the symbol %>% is like saying "then", telling the pipechain what to do next
mutate(OCCUR_TIME = as_hms(OCCUR_TIME)) %>% #changing type to time using HMS from lubridate HH:MM:SS
mutate(STATISTICAL_MURDER_FLAG = as.integer(as.logical(STATISTICAL_MURDER_FLAG))) %>% #changes true/false to 1/0
select(-c(INCIDENT_KEY,Y_COORD_CD,X_COORD_CD,Lon_Lat,Longitude,Latitude,LOC_CLASSFCTN_DESC,JURISDICTION_CODE)) %>% #removing columns we won't
mutate(across(where(is.character), ~ na_if(.x, "null"))) %>% #changes all null text to real null
mutate(across(where(is.character), ~ na_if(.x, "NA"))) %>% #changes all NA text to real null
mutate(across(where(is.character), ~ na_if(.x, ""))) #changes all blank text to real null
colSums(is.na(df)) #checking null values in each col
## OCCUR_DATE OCCUR_TIME BORO
## 0 0 0
## LOC_OF_OCCUR_DESC PRECINCT LOCATION_DESC
## 25596 0 14977
## STATISTICAL_MURDER_FLAG PERP_AGE_GROUP PERP_SEX
## 0 9344 9310
## PERP_RACE VIC_AGE_GROUP VIC_SEX
## 9310 0 0
## VIC_RACE
## 0
summary(df) #print summary as requested in project instructions
## OCCUR_DATE OCCUR_TIME BORO LOC_OF_OCCUR_DESC
## Min. :2006-01-01 Length:29744 Length:29744 Length:29744
## 1st Qu.:2009-10-29 Class1:hms Class :character Class :character
## Median :2014-03-25 Class2:difftime Mode :character Mode :character
## Mean :2014-10-31 Mode :numeric
## 3rd Qu.:2020-06-29
## Max. :2024-12-31
## PRECINCT LOCATION_DESC STATISTICAL_MURDER_FLAG PERP_AGE_GROUP
## Min. : 1.00 Length:29744 Min. :0.0000 Length:29744
## 1st Qu.: 44.00 Class :character 1st Qu.:0.0000 Class :character
## Median : 67.00 Mode :character Median :0.0000 Mode :character
## Mean : 65.23 Mean :0.1938
## 3rd Qu.: 81.00 3rd Qu.:0.0000
## Max. :123.00 Max. :1.0000
## PERP_SEX PERP_RACE VIC_AGE_GROUP VIC_SEX
## Length:29744 Length:29744 Length:29744 Length:29744
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## VIC_RACE
## Length:29744
## Class :character
## Mode :character
##
##
##
We are requested to add at least two visualizations and some analysis to our data.
df %>%
filter(
!PERP_AGE_GROUP %in% c("1020", "1028", "223","224", "940", "UNKNOWN", "(null)"),
!is.na(PERP_AGE_GROUP)
) %>%
count(PERP_AGE_GROUP) %>%
mutate(
pct = n / sum(n),
label = paste0(PERP_AGE_GROUP, " (", round(pct * 100, 1), "%)")
) %>%
ggplot(aes(x = "", y = pct, fill = label)) +
geom_col(width = 1) +
coord_polar(theta = "y") +
labs(title = "Perpetrator Age Group Distribution (%)", fill = "Age Group") +
theme_void()
From our first visualization, We see that the vast majority of incidents (that include a perp’s age) belong to the 18-44 age group. This age group accounts for about 83.6% of incidents. Some questions this raises to me, is that since the 45-64 age group percentage is so low, at only 4.7%, but the 25-44 age group is 40.5%, what is the real cut off in the 25-44 age group where the incidents start to become less? 35? 40? Better age data is required to see.
Here is the next visualization:
df %>%
count(BORO) %>%
ggplot(aes(x = reorder(BORO, -n), y = n, fill = BORO)) +
geom_col() +
geom_text(aes(label = n), vjust = -0.5) + #add the labels
labs(title = "Incident Count per Borough", x = "Borough", y = "Number of Incidents") +
theme_minimal() +
theme(legend.position = "none")
In the second visual, we compare the number of incidents throughout the 5 boroughs. Brooklyn is the borough with the most incidents, followed by the Bronx, Queens, Manhattan, and Staten Island. Indicating that Staten Island has the least shooting, and Brooklyn and the Bronx has the most shooting incidents. What factors makes Brooklyn so much worse?
We will look at a logistic categorical model showing how the time of day affects the likelihood of a shooting incident resulting in a murder. (Statistical_murder_flag)
df <- df %>%
filter(!is.na(STATISTICAL_MURDER_FLAG)) %>%
mutate(hour = as.factor(substr(OCCUR_TIME, 1, 2)))
model <- glm(
STATISTICAL_MURDER_FLAG ~ hour,
data = df,
family = binomial
)
summary(model)
##
## Call:
## glm(formula = STATISTICAL_MURDER_FLAG ~ hour, family = binomial,
## data = df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.52699 0.05403 -28.264 < 2e-16 ***
## hour01 -0.02095 0.07769 -0.270 0.78742
## hour02 -0.02063 0.08073 -0.256 0.79829
## hour03 -0.04470 0.08358 -0.535 0.59274
## hour04 0.16651 0.08319 2.002 0.04534 *
## hour05 0.38284 0.10004 3.827 0.00013 ***
## hour06 0.38232 0.12742 3.001 0.00270 **
## hour07 0.69465 0.14682 4.731 2.23e-06 ***
## hour08 0.48722 0.14918 3.266 0.00109 **
## hour09 0.33153 0.15810 2.097 0.03600 *
## hour10 0.36776 0.13907 2.644 0.00818 **
## hour11 0.38490 0.12443 3.093 0.00198 **
## hour12 0.24880 0.11575 2.149 0.03160 *
## hour13 0.29313 0.10874 2.696 0.00702 **
## hour14 0.23627 0.09777 2.416 0.01567 *
## hour15 -0.03711 0.09894 -0.375 0.70759
## hour16 -0.13743 0.09686 -1.419 0.15597
## hour17 0.23096 0.08929 2.587 0.00969 **
## hour18 0.27194 0.08397 3.239 0.00120 **
## hour19 0.06647 0.08342 0.797 0.42552
## hour20 -0.05346 0.08214 -0.651 0.51518
## hour21 0.11324 0.07658 1.479 0.13924
## hour22 0.19672 0.07412 2.654 0.00795 **
## hour23 -0.05737 0.07608 -0.754 0.45080
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 29251 on 29743 degrees of freedom
## Residual deviance: 29128 on 29720 degrees of freedom
## AIC: 29176
##
## Number of Fisher Scoring iterations: 4
The coefficients show which hours of the day are more associated with murder compared to hour 0 (midnight). Hours 5, 7, 8, and 11 show the largest positive coefficients with significantly low p values. Murders are most likely from 5am to 2pm based on this data, which was a surprise, as I expected most deadly shootings to be late at night.
This was a short report where we only looked at 2 visualizations for the data. We found out that Brooklyn has the most shooting incidents, and young men account for the large majority of these incidents (gender was not included in these visualizations but it is mostly male perpetrators).
Potential biases may be, I myself am a male within the prevalent age group, yet my friends and I do not see any thing like what was reported. Then again, I do not live within the test area, which could be another potential bias. I mitigate that by realizing that different geographical locations may have different cultures or crime rates.